Reflection
The end of the line It's been a long journey, but we're proud of the search engine we've created. We started this assignment with one goal: create an indexer that correctly parses the inconsistent, error-prone XML files. We wanted to mainly focus on this indexer and decided to give the actual search engine a lower priority. However, when we finally tackled this challenge after literally dozens of hours of programming, analyzing and improving, we found that we had gained so much experience with ElasticSearch, that it would be a shame not to create an awesome search engine too! And that's what we did. We've created a search engine that works well and goes far beyond the orginal assignment. It took us a lot longer than we expected and for the last two weeks, there have been few evenings we didn't spend with ElasticSearch, but the result is worth it. This page gives a quick summary of the extra work we've done, and also provides some closing comments. Extra features We have implemented a lot of extra features (features that the assignment didn't require). What follows is a short summary of the most important ones. Keyword in context descriptions Instead of just displaying static document descriptions, we've implemented keyword in context (KWIC) descriptions. An algorithm selects the three sentences that have the most matching query words and return these to the user, while highlighting the corresponding query words. We've also implemented KWIC titles. As described by Manning et al.Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval (Vol. 1, p. 496). Cambridge: Cambridge university press., the KWIC descriptions make it much easier for the user to quickly determine whether a file is relevant. Query optimization We've implemented many algorithms to make the returned results as relevant as possible. For example, we're using a snowball stemmer to increase recall, convert all words to lowercase, remove Dutch stop words and we've implemented ASCII Folding. Wikipedia Tagging We've used the entities that were provided in the XML files to implement Wikipedia tagging. Because the tagging isn't very accurate, we've decided to disable it by default, but it's still a nice little feature that's sometimes surprisingly handy. Faceted search We've implemented a lot of faceted search options. After seeing the search results, the user can refine these results by selecting a date range, selecting a sender, or clicking on a year in the bar chart. Advanced search operations Because our search engine is very topic-specific, we expect it to be used by experts too. That's why we've implemented a lot of professional query operators. The advanced search operations that are possible, are: wildcards, regular expressions, fuzziness, proximity searches, phrase queries, boosting, boolean operators and grouping. Experimenting with the data Instead of just implementing one algorithm for filtering the tag cloud, we've implemented two: TF-IDF and Mutual Information. Because TF-IDF works best, we've enabled that one by default. However, we allow the user to switch to Mutual Information on the fly. This allows the user to experiment with the different algorithms and interactively explore the data. Pagination As Hearst described in his book, it's not wise to show more than ten results on one page. That's why we decided to implement pagination. It works perfectly and you can easily switch between pages. A very advanced indexing algorithm Our indexing algorithm is very complex and very advanced. It took us a long time to create an indexer that automatically recognizes different parts of the document, especially given the challenges that we've described in Indexing the Collection. This isn't just an indexer that parses XML tags: it goes way beyond that. If you take a look at our code (both the Python file and the PHP script), you can see that we had to implement a lot of extra algorithms to correctly filter, parse and display the documents. We hope you'll realize how much time went into this and we hope you'll appreciate it. Closing remarks We really enjoyed the experience of creating our own search engine from scratch. It was a lot of work and, as with all programming assignments, it was quite frustrating at times. In the end however, we're really proud of what we've created. We've learned how to create an advanced indexer that parses and analyzes a file, and we've learned how to create a search engine that utilizes advanced search, faceted search and filtering, among other things. We've really enjoyed learning how to create a search engine with ElasticSearch, and we believe these skills we be very useful for us later in our careers.